Goto

Collaborating Authors

 relevant region


DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Souibgui, Mohamed Ali, Choi, Changkyu, Barsky, Andrey, Jung, Kangsoo, Valveny, Ernest, Karatzas, Dimosthenis

arXiv.org Artificial Intelligence

We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.


Review for NeurIPS paper: Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets

Neural Information Processing Systems

This paper still considers the only resolution, depth and width dimensions, which have been studied in EfficientNet. Although the discovery in this paper that "resolution and depth are more important than width for tiny networks" is different from the conclusion in EfficientNet, I feel this point is not significant enough and it seems like just a supplement for EfficientNet. I'm not saying that this kind of method is not good, but I think the insights and intuitions why resolution and depth are more important than width for small networks (derived from this way) are still not clear. In my opinion, this paper is basically doing random search by shrinking the EfficientNet-B0 structure configurations on the mentioned three dimensions, I believe the derived observation is useful but the method itself contains very limited value to the community. Even some simple searching method like evolutionary searching can achieve similar or the same purpose through a more efficient way.


Relevant Region Sampling Strategy with Adaptive Heuristic for Asymptotically Optimal Path Planning

Li, Chenming, Meng, Fei, Ma, Han, Wang, Jiankun, Meng, Max Q. -H.

arXiv.org Artificial Intelligence

Sampling-based planning algorithm is a powerful tool for solving planning problems in high-dimensional state spaces. In this article, we present a novel approach to sampling in the most promising regions, which significantly reduces planning time-consumption. The RRT# algorithm defines the Relevant Region based on the cost-to-come provided by the optimal forward-searching tree. However, it uses the cumulative cost of a direct connection between the current state and the goal state as the cost-to-go. To improve the path planning efficiency, we propose a batch sampling method that samples in a refined Relevant Region with a direct sampling strategy, which is defined according to the optimal cost-to-come and the adaptive cost-to-go, taking advantage of various sources of heuristic information. The proposed sampling approach allows the algorithm to build the search tree in the direction of the most promising area, resulting in a superior initial solution quality and reducing the overall computation time compared to related work. To validate the effectiveness of our method, we conducted several simulations in both $SE(2)$ and $SE(3)$ state spaces. And the simulation results demonstrate the superiorities of proposed algorithm.


Picture Perfect - Hackster.io

#artificialintelligence

As machine learning algorithms continue to advance, the need for good, accurately annotated datasets is becoming increasingly apparent. With less and less room for optimization of the models themselves, more attention is finally being turned to addressing issues with data quality. After all, no matter how much potential a particular model has, that potential cannot be realized without a good dataset to learn from. Image classification is a common task for machine learning models, and these models suffer from a particular type of data problem called co-occurrence bias. Co-occurrence bias can cause irrelevant details to get the attention of a machine learning model, leading to incorrect predictions. For example, if a dataset used to train an object recognition model only contains images of boats in the ocean, the model may start classifying anything related to the ocean, such as beaches or waves, as boats.


Finding Short Signals in Long Irregular Time Series with Continuous-Time Attention Policy Networks

Hartvigsen, Thomas, Thadajarassiri, Jidapa, Kong, Xiangnan, Rundensteiner, Elke

arXiv.org Artificial Intelligence

Irregularly-sampled time series (ITS) are native to high-impact domains like healthcare, where measurements are collected over time at uneven intervals. However, for many classification problems, only small portions of long time series are often relevant to the class label. In this case, existing ITS models often fail to classify long series since they rely on careful imputation, which easily over- or under-samples the relevant regions. Using this insight, we then propose CAT, a model that classifies multivariate ITS by explicitly seeking highly-relevant portions of an input series' timeline. CAT achieves this by integrating three components: (1) A Moment Network learns to seek relevant moments in an ITS's continuous timeline using reinforcement learning. (2) A Receptor Network models the temporal dynamics of both observations and their timing localized around predicted moments. (3) A recurrent Transition Model models the sequence of transitions between these moments, cultivating a representation with which the series is classified. Using synthetic and real data, we find that CAT outperforms ten state-of-the-art methods by finding short signals in long irregular time series.


Explaining machine learning models for age classification in human gait analysis

Slijepcevic, Djordje, Horst, Fabian, Simak, Marvin, Lapuschkin, Sebastian, Raberger, Anna-Maria, Samek, Wojciech, Breiteneder, Christian, Schöllhorn, Wolfgang I., Zeppelzauer, Matthias, Horsak, Brian

arXiv.org Artificial Intelligence

Machine learning (ML) models have proven effective in classifying gait analysis data, e.g., binary classification of young vs. older adults. ML models, however, lack in providing human understandable explanations for their predictions. This "black-box" behavior impedes the understanding of which input features the model predictions are based on. We investigated an Explainable Artificial Intelligence method, i.e., Layer-wise Relevance Propagation (LRP), for gait analysis data. The research question was: Which input features are used by ML models to classify age-related differences in walking patterns? We utilized a subset of the AIST Gait Database 2019 containing five bilateral ground reaction force (GRF) recordings per person during barefoot walking of healthy participants. Each input signal was min-max normalized before concatenation and fed into a Convolutional Neural Network (CNN). Participants were divided into three age groups: young (20-39 years), middle-aged (40-64 years), and older (65-79 years) adults. The classification accuracy and relevance scores (derived using LRP) were averaged over a stratified ten-fold cross-validation. The mean classification accuracy of 60.1% was clearly higher than the zero-rule baseline of 37.3%. The confusion matrix shows that the CNN distinguished younger and older adults well, but had difficulty modeling the middle-aged adults.


Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

Ray, Arijit, Cogswell, Michael, Lin, Xiao, Alipour, Kamran, Divakaran, Ajay, Yao, Yi, Burachas, Giedrius

arXiv.org Artificial Intelligence

Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users' understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users' interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly ($\rho$>0.97) with how well users can predict model correctness.


A negative case analysis of visual grounding methods for VQA

Shrestha, Robik, Kafle, Kushal, Kanan, Christopher

arXiv.org Artificial Intelligence

Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.


Collaborative Autonomy through Analogical Comic Graphs

Klenk, Matthew Evans (Palo Alto Research Center) | Mohan, Shiwali (Palo Alto Research Center) | Kleer, Johan de (Palo Alto Research Center) | Bobrow, Daniel G. (Palo Alto Research Center) | Hinrichs, Tom (Northwestern University) | Forbus, Ken (Northwestern University)

AAAI Conferences

For more effective collaboration, users and autonomous systems should interact naturally. We propose that sketch-based interaction coupled with qualitative representations and analogy provides a natural interface for users and systems. We introduce comic graphs that capture tasks in terms of the temporal dynamics of the spatial configurations of relevant objects. This paper demonstrates, through a strategy simulation example, how these models could be learned by demonstration, transferred to new situations, and enable explanations.